Unsupervised Learning

Malo Jan & Luis Sattelmayer

2025-01-16

Outline

  1. What is unsupervised learning?
  2. Topic Modeling
  3. Word Embeddings

Unsupervised Learning

What is unsupervised learning?

  • Unsupervised learning is a type of machine learning that is used to find hidden patterns in data
  • Unlike supervised learning, unsupervised learning does not require labeled data, i.e. no human intervention (at first)
  • Example applications: clustering, dimensionality reduction, anomaly detection, and topic modeling

Clustering & Dimensionality Reduction

  • Clustering is a type of unsupervised learning that is used to group similar data points together
    • kNN, k-means, hierarchical clustering
    • relies on assumption that data points relate to latent groups
  • Dimensionality reduction is a technique used to reduce the number of features in a dataset while preserving the most important information
    • PCA, t-SNE, UMAP
    • relies on assumption that data points lie on a lower-dimensional manifold

Unsupervised Learning as induction

  • Inductive reasoning starts with specific observations (e.g., raw data points) and moves toward a broader understanding or model of the data
  • Unsupervised learning is a form of inductive reasoning that is used to discover patterns in data without human intervention
  • A posteriori, we have to interpret the unsupervised classification(s)
    • clustering: what do the clusters represent?
    • dimensionality reduction: what are the most important features?

How does this relate to ML?

  • Unsupervised learning is a crucial part of the machine learning pipeline
    • preprocess data, identify patterns, and reduce the dimensionality of data
  • Unsupervised learning is often used in combination with supervised learning to improve the performance of predictive models
    • topic modeling can be used to identify topics in a collection of documents, which can then be used as labels in a supervised learning model
    • word embeddings can be used to represent words as dense vectors, which can then be used as input to a supervised learning model

Topic Modeling

What is topic modeling?

  • Topic modeling is a type of unsupervised learning that is used to discover hidden patterns in a collection of documents
  • It is used to identify topics that frequently occur in a collection of documents
  • The most common topic modeling technique is Latent Dirichlet Allocation (LDA) (Blei, Ng, and Jordan 2003)

Latent Dirichlet Allocation (LDA)

  • LDA is a generative probabilistic model that assumes that
    1. each document is a mixture of topics
    2. each topic is a distribution over words
  • Goal:infer the topics that best explain the observed documents
  • LDA widely used in text mining, information retrieval, and natural language processing

LDA

Structural Topic Modeling

  • Structural Topic Modeling (STM) (Roberts et al. 2014) is an extension of LDA that incorporates document metadata

  • It inclusion of covariates in the model, which can help to explain the relationship between topics and document metadata

  • If you have when metadata critical to your research questions (understanding how topics vary across groups, over time, or in relation to specific attributes)

LDA as a BoW approach

  • Topic models still remains a Bag of Words approach
  • It ignores:
    • word order
    • context
    • syntax
  • LDA enhances the basic BoW model by assuming that documents are mixtures of topics, and topics are distributions over words

Word Embeddings

What are word embeddings?

  • ”You shall know a word by the company it keeps” (Firth 1957)
  • Word embeddings are dense vector representations of words in a high-dimensional space
  • They are learned from large text corpora using unsupervised learning techniques
  • Words are represented as “ordered sequences: the occurrence of a given word(s) is used to predict the occurrence of the next one(s)” (Rodriguez and Spirling 2022)

  • To be simple, imagine a low-dimensional space where words are distributed according to their semantic similarity: the embeddings are their coordinates
  • Words that have a similar meaning will have similar vectors, possible to measure distance, similarity and basic arithmetic: King - Man + Woman = Queen
  • Breakthrough in NLP in 2013 estimate of word embeddings from neural networks with word2vec (Mikolov et al. 2013)
    • Predict occurence of a word given a context window: How likely is it that legislator will appear in the same context as lawmaker?

Difference to BoW

  • Bag of Words model:
    • Sparse vector representation
    • Each word is represented by a one-hot vector
    • No information about the context of the word
  • Word embeddings:
    • Dense vector representation
    • Capture semantic relationships between words
    • Words with similar meanings are close in the embedding space

Sparse one-hot vs. dense vectors

Difference between sparse and dense matrices

Dense vectors

  • Distributional hypothesis : words that appear in a similar context tend to have a similar meaning: eg. lawmaker, legislator, politician
  • Dense vectors are more efficient than BoW
  • Rationale: represent words in ~300 dimensions
  • Advantages:
    • Capture semantic relationships and process synonyms
    • Reduce dimensionality
    • Improve performance/efficiency
    • Transfer learning: pre-trained embeddings can be used in other tasks

Cosine Similarity

  • Cosine similarity: inner product of two non-zero vectors (basically taking the cosine of the angle between two vectors)
    • \([-1; 1]\) where \(-1\) = no similarity, \(1\) = same word, \(0\) = independence (orthogonal)
  • Thus, we can place words and their similarity with the help of vectors:

Word embeddings in practice

Three main algorithms

Neuralgic points of word embeddings

  1. How much context is really context?
  2. How many dimensions?
  3. Customize corpus or pre-trained corpora?
  4. What modeling approach?
  5. How to evaluate the embeddings?

WE decision tree

Textual biases working against WEs

  • Bias: although used to detect biases and social structure, word embeddings can also reflect and reinforce said biases
    • GloVe is trained on Wikipedia data, but Wikipedia is not free of bias: e.g. gender bias in article authorship or biographical representations (Wagner et al. 2016)
  • This Table illustrates racial bias in word embeddings (taken from a by Speer (2017))
Phrase Sentiment Score
Let’s go get Italian food 2.0429166109408983
Let’s go get Chinese food 1.4094033658140972
Let’s go get Mexican food 0.38801985560121732
My name is Emily 2.2286179364745311
My name is Heather 1.3976291151079159
My name is Yvette 0.98463802132985556
My name is Shaniqua -0.47048131775890656

Applications of word embeddings

  • However, these biases can also intentionally be used for research purposes
  • Word embeddings are mathematical representations of words
  • Corpora, as mentioned in the first session, are collections of texts created by social actors in a society
    • thus, corpora reflect social structure and dynamics
  • Word embeddings can be used to analyze these biases

Garg et al. (2018)

Garg et al. (2018)

Garg et al. (2018)

Shortcomings

  • One step away from BoW
  • The embeddings are static: words have a unique meaning regardless of their context
  • For instance, ”climate” will still have the same representation in different contexts
    1. ”Foreign investors know what a favourable climate the United Kingdom provides for manufacturing enterprise.”
    2. ”The Prime Minister signed the Rio convention on climate change last June.”
  • Word order still does not matter
    1. ”The government serves the people.”
    2. ”The people serve the government.”
  • Dimensonality and embedding might be hard to understand

End of Unsupervised models

Named Entity Recognition (NER)

  • Attention: NER does not count as an unsupervised approach!

  • They are trained on labeled data (human intervention)

  • But the method is in my Top 3 of ML methods

Named Entity Recognition (NER)

  • An NLP task to identify and classify entities in text
  • Depends on the NER model but here are the typical entities they detect:
Entity Type Examples
Person/Names John Doe
Organizations United Nations, Google
Locations Paris, Mount Everest
Dates/Times January 15, 2025
Miscellaneous Product names, monetary values, percentages

Named Entity Recignition (NER)

Use Cases

Entity Type Entity n
PER Hitler 25119
PER Sohn 24207
PER Kohl 23472
PER Schröder 19127
PER Merkel 18093
PER Schmidt 15862
PER Helmut Kohl 12800
PER Strauß 12375
PER Brandt 11627
PER Adenauer 11134

Use Cases

References

Blei, David M, Andrew Y Ng, and Michael I Jordan. 2003. “Latent Dirichlet Allocation.” Journal of Machine Learning Research 3 (Jan): 993–1022.
Bojanowski, Piotr, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. “Enriching Word Vectors with Subword Information.” Transactions of the Association for Computational Linguistics 5: 135–46.
Firth, J. R. 1957. “A Synopsis of Linguistic Theory, 1930-1955.” Studies in Linguistic Analysis, 10–32. https://cir.nii.ac.jp/crid/1574231874045325568.
Garg, Nikhil, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. “Word Embeddings Quantify 100 Years of Gender and Ethnic Stereotypes.” Proceedings of the National Academy of Sciences 115 (16): E3635–44.
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” Advances in Neural Information Processing Systems 26.
Pennington, Jeffrey, Richard Socher, and Christopher D Manning. 2014. “Glove: Global Vectors for Word Representation.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–43.
Roberts, Margaret E, Brandon M Stewart, Dustin Tingley, Christopher Lucas, Jetson Leder-Luis, Shana Kushner Gadarian, Bethany Albertson, and David G Rand. 2014. “Structural Topic Models for Open-Ended Survey Responses.” American Journal of Political Science 58 (4): 1064–82.
Rodriguez, Pedro L, and Arthur Spirling. 2022. “Word Embeddings: What Works, What Doesn’t, and How to Tell the Difference for Applied Research.” The Journal of Politics 84 (1): 101–15.
Wagner, Claudia, Eduardo Graells-Garrido, David Garcia, and Filippo Menczer. 2016. “Women Through the Glass Ceiling: Gender Asymmetries in Wikipedia.” EPJ Data Science 5: 1–24.